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Abstract 

In this paper, we present a GPU-accelerated direct-sum boundary integral method to solve the 
linear Poisson-Boltzmann (PB) equation. In our method, a well-posed boundary integral formu- 
lation is used to ensure the fast convergence of Krylov subspace based linear algebraic solver such 
as the GMRES. The molecular surfaces are discretized with fiat triangles and centroid colloca- 
tion. To speed up our method, we take advantage of the parallel nature of the boundary integral 
formulation and parallelize the schemes within CUDA shared memory architecture on GPU. The 
schemes use only HiV + 6N C size-of-double device memory for a biomolecule with N triangular 
surface elements and N c partial charges. Numerical tests of these schemes show well-maintained 
accuracy and fast convergence. The GPU implementation using one GPU card (Nvidia Tesla 
M2070) achieves 120-150X speed-up to the implementation using one CPU (Intel L5640 2.27GHz). 
With our approach, solving PB equations on well-discretized molecular surfaces with up to 300,000 
boundary elements will take less than about 10 minutes, hence our approach is particularly suitable 
for fast electrostatics computations on small to medium biomolecules. 
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1 Introduction 



Molecular mechanics uses Newton's classical mechanics to model molecular systems. The potential 
energy of all systems in molecular mechanics is calculated using force fields. Among all components 
of the force fields, electrostatics are critical due to their ubiquitous existence and are expensive 
to compute since they are long-range pairwise interactions. Poisson-Boltzmann (PB) model is an 
effective alternative for resolving electrostatics that includes energy, potential and forces of solvated 
biomolecules pp. As an implicit solvent approach, the PB model uses a mean field approximation to 
trace the solvent effects and applies Boltzmann distribution to model the mobile ions. These implicit 
treatments make the PB model computationally more efficient compared to explicit solvent models, 
in which atomic details of solvent molecules and mobile ions are explicitly described. 

In the PB model, the computational domain R 3 is divided into the solute domain f^i and the 
solvent domain Q2 by a closed molecular surface T such that IR 3 = f^i Uf^UT. The molecular surface 
r is formed by the traces of a spherical solvent probe rolling in contact with the van del Walls balls 
of the solute atoms [21 IB]- The molecule, which is located in domain Qi with dielectric constant £\, is 
represented by a set of N c point charges carrying Qi charge in the units of e c , the elementary charge, 
at positions Xj,i = 1, ...,N C . The exterior domain contains the solvent with dielectric constant £2, as 
well as mobile ions. For x = (x,y,z), the PB equation for the electrostatic potential in each domain 
is derived from Gauss's law and the Boltzmann distribution. Assuming weak ionic strength (e.g., the 
concentration of the physiological saline in a room temperature), the linearized PB equation and its 
interface jump conditions and boundary conditions have the forms 

N c 

V- (ei(x)V0i(x)) = -^fc<y(x-xO in fii, (1) 

i=l 

V-(£ 2 (x)V0 2 (x))-k 2 <Mx) = O hi n 2 , (2) 

0i(x) = 2 (x), £1 „ = £ 2^ on T = dQi = dn 2 , (3) 

ov ov 

lim 02 (x) = 0, (4) 

|x|— too 

where <p\ and <p2 are the electrostatic potentials in each domain, qi = e c Qi/kBT,i = 1,...,N C , e c the 
electron charge, ks the Boltzmann's constant, T the absolute temperature, 5 the Dirac delta function, 
k the Debye-Hiickel parameter, and v the unit outward normal on the interface T. Note the k 2 0(x) 
term in Eq. ([2]) is the linearized form of k 2 sinh(0(x)) when weak ionic strength is assumed. 

The PB equation is an elliptic equation defined on multiple domains with discontinuous coefficients 
across the domain interface. The PB equation has analytical solutions only for the simple geometries 
such as spheres [3] or rods [5]. For molecules with complex geometries, the PB equation can only 
be solved numerically, which is challenging due to the non-smoothness of the solution subject to 
Eq. ([3]), the complex geometry of the interface T, the singular partial charges in Eq. ([]]), and the 
boundary conditions at the infinity in Eq. @. Many numerical PB solvers were developed and they 
can be roughly divided into two categories: 1) the 3-D mesh-based finite difference/finite element 
methods [3 El El QUI EE] ; and 2) the boundary integral methods [T21 H3 H31 HSl HSl HZl UHl USl 
l20l I21j . All these methods have their own advantages and disadvantages. For example, the PB 
solvers embedded in molecular modeling packages, such as Dephi [6j, CHARMM [7], AMBER [8], and 
APBS [9] use standard seven-point finite difference to discretize the PB equation. Although standard 
finite difference methods arguably result in reduced accuracy due to the smoothened treatment of 
Eq. (J3]), the efficient, robust and user- friendly features of these PB solvers brought their popularity 
to the computational biophysics/biochemistry community. Compared with 3-D mesh-based methods, 
the boundary integral methods have advantages since they impose the singular partial charge in 
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Eq. (HJ) and the far-field boundary condition in Eq. d3|) exactly, use appropriate boundary elements to 
represent surface geometry to desired precision, and enforce the continuity condition in Eq. across 
the interface explicitly They are naturally more convenient methods to attack numerical difficulties 
in solving the PB equation. 

An often claimed advantage of boundary integral methods compared to finite difference methods 
is the reduction of the 3-D differential equation to a 2-D surface integral equation. However, the 
finite difference methods generate a 7-band sparse matrix, while the boundary integral methods 
constitute a fully dense matrix, which is prohibitively expensive to store and whose matrix-vector 
product is computationally costly. The remedy is to generate the matrix on-the-fly and compute 
the matrix-vector product with the assistance of fast algorithms for TV-body problems, such as Fast 
Multipole Methods (FMM) [E2 EE E2 EI] and treecode [T7] . In the last few decades, the advent 
of the multicore computers and related parallel architectures such as MPI and Open MP brought 
the boundary integral methods to a superior position to 3-D mesh-based methods. Recently, the 
appearance of GPU computing further boosts the boundary element methods, particularly for schemes 
that have simpler algorithms and use less memory. 

GPU computing refers to the use of graphics processing units for scientific and engineering com- 
puting applications rather than traditionally rendering graphs. A GPU card can be treated as an 
array of many simplified CPUs with reduced but concurrent computing power and more limited but 
faster memory access to CPUs. GPUs execute many concurrent threads relatively slowly, rather than 
a single thread quickly. This means that GPU computing is more suitable for problems with high 
concurrency, straightforward workflow, low memory requirements, and infrequent communication. 
GPUs have broad areas of application, particularly in speeding up molecular modeling. Molecular 
dynamics packages such as AMBER [22l E3] and NAMD [Ml ES] have GPU implementations that 
achieve significant speedup compared to CPU implementations. 

The matrix-vector product computed in boundary integral formulation is similar to comput- 
ing iV-body problems for particle interactions. Solving iV-body problems and related applications 
in boundary integral methods are popular targets for GPU computing. For examples, Nyland et. 
al [26j computed iV-body interactions by direct summation, Burtscher et. al [27] used the Barnes- 
Hut treecode [28J to compute the dynamics of 5 million point masses on a 1.3GHz GPU with 240 
threads, obtaining a speedup of 74 over a 2.53 GHz CPU, and Yokota et. al [29] demonstrated a FMM 
accelerated boundary integral PB solver on 512 GPUs, achieving 34.6 TFlops. 

In this paper, we present a parallel boundary integral PB solver on GPUs. We adopt the well-posed 
formulation from Juffer et al. |12j, rather than the straightforward integral formulation presented in 
[30] and applied in [15} I29j . The N-body summation is computed directly, therefore the scheme 
is implementation convenient and memory saving for GPU computing. The rest of the paper is 
organized as follows. In section 2, we provide our algorithms including the well-posed boundary 
integral formulation and its discretization, followed by CUDA implementation on GPUs. In section 
3, we present the numerical results, first on the spherical cavities with multiple partial changes, whose 
analytical solutions are available and then on a series of proteins with various sizes and geometries. 
This paper ends with a section of concluding remarks. 

2 Methods 

We use the well-posed boundary integral formulation from Juffer's work |12j together with a flat 
triangulation and a centroid collocation. A factor affecting the accuracy of boundary integral method 
is the discretization of the surface. We use a non-uniformed triangular surface from MSMS [31j . 
Throughout this paper, we call our GPU-accelerated boundary integral Poisson-Boltzmann solver as 
GABI-PB solver. 
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2.1 Well-posed integral formulation 



The differential PB equations ([T]) and ([2]) can be converted to boundary integral equations. In 
order to do this, we first define the fundamental solutions to the Poisson equation ([1]) in f^i and the 
fundamental solution to the PB equation ([2]) in Q 2 as 



C (x,y) 



47r|x — y| 



G*(x,y) 



-«|x-y| 



4?r x 



(5) 



Note Go(x, y) and G K (x, y) are called Coulomb and screened Coulomb potentials in electrostatic 
theory. By applying Green's second theorem, and canceling the normal derivative terms with interface 
conditions in Eq. ([3]), the coupled integral equations can be derived as [30] : 



<Mx) = 



' , ,dMy) dG (x,y) 
G (x,y)^ 0i (y) 



k=l 



n i ,d<fo(y) . <9G K (x,y) 
-G K (x,y) — + 2 (y) 



dS, 



x e ri 2 . 



(6) 



(7) 



However, straightforward discretization of Eqs. ([6]) and ([7]) yields an ill-conditioned linear system, 
whose condition number dramatically increases as number of boundary elements increases [32] . Juffer 
et al. derived a well-posed boundary integral formulation by going through the differentiation of 
the single-layer potentials and the double-layer potentials [12j . Here the single-layer potentials are 
from the induced point charge distributions Go and G K on surface T, while the double-layer potential 
are from the induced dipole charge distributions, which are the normal derivatives of Go and G K on 
surface V. The desired integral forms are as: 



1 



1 + e) 01 (x) 



1 + 



1 



-(x) 



di 



^i(x,y)^^ + K 2 (x,y)<My) 
^3(x,y)^^ + K 4 (x,y)<My) 



dSy + S'i(x), x e r, 

dSy + Sk(x), x G r, 



(8) 
(9) 



with the notation for the kernels 

^ 1 (x,y)=G (x,y)-G ft (x,y), 

9G (x,y) 10G„(x,y) 



^2(x,y) = e 



9G K (x,y) 0G o (x,y) 



K 3 (x,y) 



di 



^ 4 (x,y) 



<9 2 G K (x,y) <9 2 G (x,y) 



Si ( x ) = J2 qkG °^ yk ^ 



k=l 



Q / s, <9G (x,y fc ) 
ft(x) = ^ gfc ■ 



(10) 



fc=i 



and e = £i/£ 2 - Note this is the well-posed Fredholm second kind of integral equation, which is also 
our choice in this paper. 



2.2 Discretization 

We discretize the integral equations (JSj) and © with the flat triangle and the centroid collocation 
(the quadrature point is located at the center of each triangle). This scheme also assumes that the 
potential and its normal derivative, as well as the kernel functions are uniform on each triangle. 
When the singularities in kernels occur (x = y), the contribution of this triangle in the integral is 
then simply removed. This scheme, which provides the convenience of incorporating fast algorithms, 
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such as FMM [32] and treecode [17] , is in fact widely used in the latest boundary integral methods 
in solving PB equations. 

In this paper, suppose the triangulation program MSMS [3T] discretizes the molecular surface to 
a set of N triangular elements connected with N v vertices and N e edges. We then have the relation 
N v + N — N e = 2 called Euler's polyhedron formula. With potential 0i(x) and its normal derivative 
^q^ x ^ at the centroid of each triangular element as the unknowns, Eqs. ([8]) and ([9]) are converted to 
a linear algebraic system Au = b, whose elements are specified as follows. Note we do not explicitly 
express A in an iterative method as we only concern Au on each iteration. 

For i = 1, 2, N, the ith and the (i + iV)th element of the discretized matrix- vector product Au 
are given as 



{Au}; = i(l +£ )^( Xi )- W 3 ^l(x 4 ,X j )^^ + K 2 (x J ,X J )0 1 (x J ) 



1 / , 1\ &£i(xi) 



{Au}^ = - 1 + - ^L- ^ W, 



N 



3 hj/i 



K 3 (x.i,Xj)— — J — +2f 4 (x i ,Xj)^i(xj 



(12) 
(13) 



where Wj is the area of the jth element. Note we use Xj and Xj (instead of y~j) in all kernels to 
indicate sources and targets are the same set of points. The expressions of b; and bj+^v are directly 
obtained from Si and 5*2 in Eq. (Illh . In producing the results of this paper, we apply GMRES solver 
|33j to solve the linear algebraic system from the discretization of Eqs. (J8j) and Q, which requires 
computing the matrix-vector product in Eqs. (|12f) and (|13p on each iteration. 

2.3 Electrostatic solvation energy formulation 

To perform the solvation analysis of an interested biomolecules, the electrostatic solvation energy is 
computed by 



h N c 1 N c 

fc=l fc=l 



Ki(x k ,y)^±+K 2 (x k ,y)My) 



OUy 



dSy, (14) 



where <^ r eac(xfc) = ^i(xfc) — S'i(xfc), whose formulation is the integral part of Eq. (fT4"|) . is the reaction 
potential at the A;th solute atom. The electrostatic solvation energy, which can be regarded as the 
atomistic charge q k weighted average of the reaction potential <f> rcac , can effectively characterize the 
accuracy of a PB solver. 

2.4 GPU/CUDA implementation 

The GABI-PB solver uses the boundary integral formulation, which can be conveniently parallelized. 
The majority of the computing time is taken by the following subroutines. 

(1) Compute the source term in Eqs. (JHJ) and ([9]). 

(2) Perform matrix-vector product as in Eqs. (|12p and f)13|) on each GMRES iteration. 

(3) Compute electrostatic solvation energy. 

Among all of these subroutines, subroutine (2) is the most expensive one and is repeatedly computed 
on each GMRES iteration. We compute all of these routines in parallel to minimize the computation 
time. Table [Q provides the pseudo code for the overall computation. 

In this psuedo code, we divide all the operations to those on host performed by the CPU and 
those on device performed by the GPU. We use N c to denote the number of atoms of the biomolecule 
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Table 1: Pseudocode for GABI-PB solver using GPU 



1 


On 


host (CPU") 


2 




rpan nimnnlppnlp nata [pnaT'fJ'P and ^trnpf nrp ) 


3 




pall TVTXTYTS to cfPTipratp rviariPmlaf inn 


4 




copy biomolecule data and triangulation to device 


5 


On 


device (GPU) 


6 




pnph tnrpaH pnnpnrrpnrlv pnrn'nnfps ann kfotpk soiitpp fprirm tot a^mcnpH tnanclps 


7 




ponv sonrpp tprms nn Hpvipp to host 


8 


On 


host 


9 




spt initial pmpss Yn for C-rTVTRT^S itpvation anH ronv it to rlpvirp 


10 


On 


H pvi pp 


11 




paph thrpaH ronniTTPntlv roinnntps assipriprl spprnpnt of matri"X"-vpptor nrorlnrt v = Ax 


12 




copy the computed matrix-vector y to host memory 


13 


On 


host 


14 




test for GMRES convergence 


15 




if no, generate new x and copy it to device, go to step 10 for the next iteration 


16 




if yes, generate and copy the final solution to device and go to step 17 


17 


On 


device 


18 




compute assigned segment of electrostatic solvation energy 


19 




copy results in step 18 to host 


20 


On 


host 


21 




add segments of electrostatic solvation energy and output result 



and A?" to represent the number of triangular elements. In the following description, we include the 
memory use in a pair of parentheses following each variable we would have claimed. 

We first read in the atomistic coordinates (3N C ), radius (N c ) and charges (N c ) of the given 
biomolecule (step 2) and call the MSMS |31j program to generate triangulation (step 3) including the 
faces and the vertices of the triangulation. We convert these triangulation information to centroid 
coordinates (3N), normal direction vectors (3iV) and triangle areas (N), and copy them together with 
the atomistic coordinates, radii and charges to the device (step 4). So far, the device memory use 
is 5N C + 7N size-of-double. We then use multithreads on device to compute the source terms (2N, 
but deallocated right after being copied to host) in Eqs. ([8]) and ([9]) (step 6) and copy them to the 
host (step 7). Next, we set the initial guess xo of the solution (currently a zero vector) and copy it to 
device (step 9). Note that, when the GMRES is restarted (usually this needs to be done every 10-20 
steps to ensure the small number of terms in Krylov subspace expansion), we use the approximated 
solution achieved from the previous iteration as the initial guess xo- 

What follows is the major computing part. We compute the matrix- vector product y = Ax (step 
11) on multi-threads GPU device, and copy the resulting y to the host (step 12). These two steps 
take additionally 4iV size-of-double device memory. Then on host we test if the obtained y satisfies 
the GMRES convergence criterion (step 14) to decide whether to start the next iteration (step 15) or 
terminate the GMRES and generate the final solution (step 16). After GMRES convergence criterion 
is satisfied, we need to compute the atom-wise components of the electrostatic solvation energy on 
device using the final solution copied from the host (step 18). This step takes additional N c size-of- 
double device memory. Finally, with the components of the electrostatic solvation energy copied from 
device (step 19), we compute the total electrostatic solvation energy on host (step 21). All together, 
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we have allocated 6N C + UN size-of-double device memory. 

In implementing algorithms, we tune the CUDA code with some optimization strategies like loop 
unrolling, coalesced memory access (using double3 data type instead of three double variables for 
some structures like position and charge), and fast CUDA operators like "rsqrt". We also test and 
choose the optimized value 256 as the number of thread per block. Although these strategies did not 
substantially improve the performance, they contribute to the overall performance. 



3 Results 

In this section, we present the numerical results. We first solve the PB equation on a spherical 
cavity with multiple charges at different locations. The analytical solutions in terms of spherical 
harmonics are available [I]. We then solve the PB equation on a series of 24 proteins with different 
sizes and geometries. These protein structures are downloaded from protein data bank (PDB), and 
their charges and hydrogens are added with CHARMM22 force field [35j. We report the electrostatic 
solvation energy results and its associated execution time for solving the PB equation and computing 
electrostatic solvation energy on these proteins with and without GPU acceleration to demonstrate 
the improved efficiency from using GPU. All algorithms are written in C and CUDA and compiled 
by gcc with flag "-03" and nvcc with flag "-03 -arch=sm_20 -use_fast_math" . The simulations are 
performed with a single CPU (Intel(R) Xeon(R) CPU E5640 @ 2.27GHz with 2G Memory) on a 
12-core workstation and one GPU (Nvidia Tesla M2070) card. Before we reveal the numerical results, 
we define order and error. 



3.1 Order and Error 

In this paper, we report the relative error of the surface potential, which is defined as 

max \6 num (xi) - (b exa (xi)\ 

e - i=1 '-' N (15) 
^ max \6 exa (xi)\ ^ ' 

i=l,..., N 



where N is the number of unknowns, also the number of triangular elements of a particular discretized 
molecular surface. The notation (j) num represents numerically solved surface potential and 4> exa denotes 
the analytical solutions obtained by Kirkwood's spherical harmonic expansion [4]. The discretization 
on the molecular surface has a parameter "density" , number of vertices per A 2 . 
The numerical order of accuracy is computed with 

gCoarse 

Order = log coarse.mesh -^p (16) 

fine.mesh P Ilne 

following the convention of numerical analysis, where "mesh" refers to density for boundary integral 
methods at both coarse and fine levels. 



3.2 Accuracy tests on a spherical cavity 

We first perform the numerical tests for the accuracy of the algorithm on a spherical cavity, where the 
analytical solution in terms of spherical harmonics expansion is available [4] . The test case we designed 
is a sphere of radius r = 4A containing nine partial charges, which are located along the space curve 
r(t) =< cost, J- sin t, ^ > for t = 0, §,•••, 2tt. These nine point charges carry 0.1, 0.2, . . . , 0.9 
electric charges respectively in the units of e c , the elementary charge. We plot the charge locations in 
Fig. [H which resembles a segment of a biological helix. Table [2] reports the numerical results. In the 
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Figure 1: Locations of the partial charges inside a spherical cavity with radius r = 4 A 



first column of the table, we increase the MSMS input parameter "density" (number of vertices per 
A 2 ) by doubling its current value each time. The number of triangular elements are also approximately 
doubled each time as seen in column 2. The electrostatic solvation energy are reported in column 3 
and we can see a consistent pattern that these values are approaching the true value -952.52kcal/mol. 
Column 4 is the relative error of the surface potential, whose convergence pattern can be better 
seen from column 5 in terms of orders. Note that the 0.5th order observed here is relative to the area 
of the triangular element. If the 1-D length is considered, the order should be about one. 

Table 2: Accuracy tests on a spherical cavity: radius r=4A, nine charges are located on the space curve 
< cost, ^-sint i > for t = : | : 2n; is the relative error in surface potential 0; order is relative to 
the area of the elements; the exact value of E so i is -952.52 kcal/mol. 



density 


# of ele. 


E so \ 




order 


1 


370 


-971.92 


4.20E-02 




2 


736 


-968.05 


2.80E-02 


0.58 


4 


1572 


-964.03 


1.85E-02 


0.60 


8 


3124 


-961.08 


1.28E-02 


0.53 


16 


6308 


-958.75 


8.90E-03 


0.52 


32 


12772 


-956.99 


6.24E-03 


0.51 


64 


25494 


-955.74 


4.35E-03 


0.52 


128 


51204 


-954.81 


3.06E-03 


0.51 



3.3 Accuracy and efficiency test on proteins 

We next solve the PB equation and compute the electrostatic solvation energy on a series of proteins 
with different sizes and geometries. The numerical results are reported in Table El In this table, the 
first column is the index for the convenience of identification. The second row is the four-digit protein 
data bank (PDB) ID of the corresponding protein. Column 3 is the number of triangular elements 
when the molecular surfaces of these proteins are discretized. We uniformly choose "density=10" 
so that proteins with larger molecular surface areas as seen in column 5 will normally generate 
larger number of elements. A few exceptions happen when MSMS modifies the given density to fit its 
triangulation needs, resulting in a slightly mismatched order for the data in columns 3 and 5. Column 



4 gives the number of atoms of associated proteins. We can see in most of the cases the larger number 
of atoms results in the larger molecular surface area and number of elements. Column 5 lists the 
number of iterations. Prom this column, we noticed the GABI-PB solver converges within 20 steps 
on the majority of the proteins and the number of iterations does not increase with the increment of 
the number of elements, which contributes to the well-posed integral formulation. Occasionally, the 
number of iteration is high, for example for the 7th, 8th, and 24th proteins. This largely attributes the 
fact that the MSMS triangulations of these proteins have triangles with very small areas or very short 
sides. Currently we are working on a project to improve the triangulation. We also plot the number 
of iterations vs the number of elements in Fig. [2ja) on which the fact that most numbers of iterations 
are consistently low with a few exceptions is visualized. Column 7 reports the solvation energies. 
We compare these values with results from our previous work in [34] and consistency of results are 
observed. Column 8 shows the time for solving PB equation and computing the electrostatic solvation 
energy on one CPU (T± in seconds), Column 9 displays the time for that with GPU acceleration (T p 
in seconds), and Column 10 reports the ratio between them. From column 9, we can see most of the 
jobs are finished within 2-3 minutes except jobs on proteins with slow convergence (e.g. the 7th, 8th 
and 24th proteins). The ratio in the last column demonstrates the overall 120-150X parallel speedup. 

Table 3: Numerical results for computing the electrostatic solvation energy of 24 proteins: N is the number of 
elements, iV c is the number of atoms, iVjj is the number of iterations, area is the molecular surface area, E so i 
is the electrostatic solvation energy, T\ is the running time on one CPU, and T p is the running time with GPU 
acceleration. 



ID 


PDB 


N 


N c 


area(A 2 ) 


Nit 


i?soi(kcal/mol) 


CPU Ti(s) 


GPU T p (s) 


Ti/T p 


1 


lajj 


40496 


519 


2176 


8 


-1145.76 


1323 


11 


125 


2 


2erl 


43214 


573 


2329 


8 


-961.68 


1410 


11 


124 


3 


lcbn 


44367 


648 


2377 


8 


-307.03 


1486 


12 


129 


4 


lvii 


47070 


596 


2488 


11 


-915.18 


2501 


19 


129 


5 


lfca 


47461 


729 


2558 


8 


-1215.34 


1699 


13 


127 


G 


lbbl 


49071 


576 


2599 


10 


-1001.47 


2267 


17 


134 


7 


2pde 


50518 


667 


2727 


99 


-824.91 


25985 


194 


134 


8 


lshl 


51186 


702 


2756 


100 


-760.71 


27160 


201 


135 


9 


lvjw 


52536 


828 


2799 


8 


-1255.93 


2085 


16 


133 


10 


luxe 


53602 


809 


2848 


9 


-1154.60 


2437 


18 


136 


11 


lptq 


54260 


795 


2910 


10 


-883.77 


2778 


21 


130 


12 


lbor 


54629 


832 


2910 


12 


-863.91 


3657 


28 


132 


13 


lfxd 


54692 


824 


2935 


7 


-3344.56 


1963 


15 


129 


14 


lr69 


57646 


997 


3068 


9 


-1100.83 


2823 


22 


131 


15 


lmbg 


58473 


903 


3085 


9 


-1368.58 


2901 


22 


132 


1G 


lbpi 


60600 


898 


3247 


24 


-1320.89 


9004 


G4 


141 


17 


lhpt 


61164 


858 


3277 


11 


-825.59 


4248 


32 


133 


18 


451c 


79202 


1216 


4778 


19 


-1038.20 


11812 


87 


136 


19 


lsvr 


81198 


1435 


4666 


10 


-1731.54 


7294 


53 


138 


20 


lfrd 


81972 


1478 


4387 


9 


-2890.28 


5676 


41 


139 


21 


la2s 


84527 


1272 


4457 


16 


-1941.37 


11461 


83 


139 


22 


lneq 


89457 


1187 


4738 


19 


-1756.02 


15077 


106 


142 


23 


la63 


132134 


2065 


7003 


11 


-2404.07 


19804 


139 


143 


24 


la7m 


147121 


2809 


7769 


54 


-2184.05 


124326 


852 


146 



We also plot the results of the last three columns in Fig. [2j From Fig. EJ^b), we see the CPU time 
T\ is consistently a scale of 100+ of the GPU time T p . Time increases as the number of elements 
increases normally. Fig. [2(c) enables us to better observe that the speed-up increases as the number 
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of elements increases, which is advantageous for solving problems of larger sizes. 



(a) (b) (c) 




# of elements TV # of elements N # of elements N 



Figure 2: Numerical results for solving PB equation and computing electrostatic solvation energy on 
a set of 24 proteins: (a) number of iterations; (b) CPU time T\ and GPU time T p ; (c) the parallel 
speedup T\/T p . 

Next we focus on two proteins: lfrd and lsvr. Results of these two proteins in Table [3] show 
fast convergence with number of iterations around 10 and they have surface areas more or less of 
4500A 2 and number of elements is about 80,000 at "density=10" . To test the accuracy and efficiency 
of GABI-PB solver on these two proteins, we solve the PB equation and compute the electrostatic 
solvation energy at different "densities" ranging from 1 to 32 by doubling its value each time. The 
numerical results are shown in Table 01 We can see that as the densities are doubled each time, 
the number of elements are doubled approximately. The solvation energies at different densities are 
approaching to the values at the largest density 32. The running time T p with GPU acceleration 
increases at the rate of 0(N 2 ) after the number of elements are sufficiently large. Both cases show 
that, for solving PB equation and computing electrostatic solvation energy on molecular surfaces 
discretized with nearly 300,000 elements, the time required is only about 10 minutes. The number of 
iteration in Table 0] shows that the finer resolution for the same molecular surface will not increase 
the number of iterations. To further investigate the convergence of accuracy on proteins, we plot the 
electrostatic solvation energy on Fig. [3l By using cubic interpolation, we can see the electrostatic 
solvation energy computed for both proteins eventually converge toward its interpolated value (the 
red "*"). 

Table 4: Numerical results for solving PB equation and computing the electrostatic solvation energy on protein 
lfrd and lsvr. 





lfrd 


lsvr 


density 


# of ele. 


E so l 


GPU T p (s) 


# of it. 


# of ele. 


-Esol 


GPU T p (s) 


# of it. 


1 


10176 


-3360.69 


2 


18 


13974 


-2038.30 


2 


11 


2 


17710 


-3003.45 


2 


9 


22946 


-1833.41 


4 


10 


4 


34520 


-2931.05 


8 


9 


38116 


-1769.44 


10 


9 


8 


66294 


-2894.02 


28 


9 


71030 


-1735.37 


31 


9 


16 


134180 


-2880.90 


109 


9 


144168 


-1723.64 


140 


10 


32 


266702 


-2872.65 


423 


9 


284478 


-1716.94 


642 


11 
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Figure 3: Accuracy tests in terms for electrostatic solvation energy for proteins lfrd (a) and lsvr (b). 



4 Conclusion 

This paper describes a direct summation based GPU-accelerated boundary integral Poisson-Boltzmann 
(GABI-PB) solver. This solver discretizes the molecular surfaces with flat triangles and performs nu- 
merical integration with centroid collocation schemes. The numerical tests on spherical cavities show 
that GABI-PB solver can achieve 0.5th order convergence on surface potentials relative to the number 
of elements, which is a 1st order convergence relative to the length. The accurate surface potentials 
are of vital importance to molecular modelings that are sensitive to electrostatics near or on the 
molecular surface. Meanwhile, this direct-sum boundary integral implementation uses memory effi- 
ciently. It has been shown that we only need to allocate altogether 6iV c + lliV size-of-double device 
memory, where N c is the number of atoms and ./V is the number of triangular elements. In addition, 
numerical tests on a series of 24 proteins show fast convergence, consistent electrostatic solvation 
energy computation, as well as 120-150X speedup using a single GPU (Nvidia Tesla M2070) card. 

The major limitation of the direct-sum scheme is obviously the 0(N 2 ) computational cost, which 
becomes prohibitively expensive when the dimension of the problem increases to certain level. Even 
though using multiple GPU cards can offset some of this effect, it costs extra design, implementation, 
and hardware purchase. A remedy to this is applying the fast algorithms such as the 0{N dog N) 
treecode [17] and the much more complicated and memory-consuming 0{N) FMM [32], which are 
under our investigation. It is hard to deny that the adoption of fast algorithms is necessary for large 
sized problems. However, there is a region on which the direct-sum GABI-PB solver has advantages 
over boundary integral PB solvers with fast algorithms. This region is bounded by what we called 
critical point where the fast algorithms surpasses the direct sum. For example, the CPU implemen- 
tation of a treecode algorithm in solving boundary integral PB equation [T7] shows a critical point at 
about N = 4000 with p = 3 (the order of Taylor expansion) and MAC < 0.5 (multipole acceptance 
criterion, the ratio between cluster radius and the distance between target particle and the center 
of the cluster). In GPU implementation, the critical point will be much bigger in considering the 
communication and memory access. We will identify the value of critical point in our future work. 

Memory usage is a critical factor of GPU performance. For solving boundary integral PB, direct- 
sum use about 1/3 of memory of treecode [17] and 1/7 of memory of FMM [32]. The current GPU 
implementation is run on an available Tesla M2070 card. When we are considering to rerun all the 
tests on a Tesla M2090, Nvidia announced its release of Tesla K10 followed by K20 and K20x, showing 
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the rapid hardware update year after year. Taking M2070 and K20x as examples for comparison, the 
peak double precision floating points performance increases from 515Gflops to 1.31Tflops, the memory 
bandwidth increases from 150 GB/sec to 250/sec, the number of CUDA cores increases from 448 to 
2688. However the memory is still limited at 6Gbyte. These comparisons indicate the proposed direct- 
sum boundary integral PB solver, benefited from its low memory use, will demonstrate continuously 
improving performance with the flow of GPU hardware updates. 

In addition to including fast algorithms, there are many spaces in which GABI-PB can be improved 
and extended. For example, we are looking for better triangulation programs for the molecular 
surfaces [36j [371 EH] to avoid the slow convergence pattern for some proteins as seen in our tests. 
A more challenging problem is the application of GABI-PB to molecular dynamics |4H 142] . where 
the PB equation will be solved at every time sampling. For molecular surfaces discretized within 
50,000 elements, GABI-PB can resolve each sample in a few seconds or less. Furthermore, the GPU- 
accelerated boundary integral scheme has the potential to solve other integral equations such as the 
Helmholtz equation and Maxwell Equations. 
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